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Abstract 


Orthographic similarities across languages 
provide a strong signal for probabilistic 
decipherment, especially for closely re¬ 
lated language pairs. The existing deci¬ 
pherment models, however, are not well- 
suited for exploiting these orthographic 
similarities. We propose a log-linear 
model with latent variables that incor¬ 
porates orthographic similarity features. 
Maximum likelihood training is computa¬ 
tionally expensive for the proposed log- 
linear model. To address this challenge, 
we perform approximate inference via 
MCMC sampling and contrastive diver¬ 
gence. Our results show that the pro¬ 
posed log-linear model with contrastive 
divergence scales to large vocabularies and 
outperforms the existing generative deci¬ 
pherment models by exploiting the ortho¬ 
graphic features. 


1 Introduction 


Word-level translation models are typically 
learned by applying statistical word alignment 
algorithms on large bilingual parallel cor¬ 


pora (Brown et ah, 19931. However, building a 
parallel corpus is expensive, and data is limited 
or even unavailable for many language pairs. On 
the other hand, large monolingual corpora can 
be easily downloaded from the internet for most 
languages. Decipherment algorithms exploit such 
monolingual corpora in order to learn translation 
model parameters, when parallel data is limited or 


unavailable (Koehn and Knight, 2000 Ravi and 
Knight, 201 H Dou et ah, 2014 1 . 

Existing decipherment methods are predom¬ 
inantly based on probabilistic generative mod¬ 
els ( Koehn and Knight, 2000[ [Ravi and Knight, 
2011 Nuhn and Ney, 2014 Dou and Knight, 


20121. These models exploit the statistical similar¬ 


ities between the n-gram frequencies in the source 
and the target language, and rely on the Expec¬ 


tation Maximization (EM) algorithm (Dempster 


et ah, 19771 or its faster approximations. These 


existing models, however, do not allow incorpo¬ 
rating linguistically motivated features. Previ¬ 
ous research has shown the effectiveness of in¬ 
corporating linguistically motivated features for 
many different unsupervised learning tasks, such 
as: unsupervised part-of-speech induction (Berg- 


Kirkpatrick et ah, 2010[ Haghighi and Klein, 


20061, word alignment (Ammar et ah, 2014 


Dyer et ah, 201 1[), and grammar induction (Berg- 


Kirkpatrick et ah, 20101. In this paper, we present 


a feature-rich log-linear model for probabilistic 
decipherment. 


Words in different languages are often derived 
from the same source, or borrowed from other lan¬ 
guages with minor variations, resulting in substan¬ 
tial phonetic and lexical similarities. As a result, 
orthographic features provide crucial information 
on determining word-level translations for closely 
related language pairs. Haghighi et al. (20081 pro¬ 
posed a generative model for inducing a bilingual 
lexicon from monolingual text by exploiting or¬ 
thographic and contextual similarities among the 
words in two different languages. The model pro¬ 
posed by Haghighi et al. learns a one-to-one map¬ 
ping between the words in two languages by ana¬ 
lyzing type-level features only, while ignoring the 
token-level frequencies. We propose a decipher¬ 
ment model, that unifies the type-level feature- 
based approach of Haghighi et al. with the token- 
level EM based approaches ([Koehn and Knighh 


2000 Ravi and Knight, 2011] ). 


One of the key challenges with the proposed la¬ 
tent variable log-linear models is the high compu¬ 
tational complexity of training, as it requires “nor¬ 
malizing globally” via summing over all possible 
observations and latent variables. We perform ap- 






























































proximate inference using Markov Chain Monte 
Carlo (MCMC) sampling for scalable training of 
the log-linear decipherment models. The main 
contributions of this paper are: 

• We propose a feature-based decipherment 
model that combines both type-level ortho¬ 
graphic features and token-level distribu¬ 
tional similarities. Our proposed model out¬ 
performs the existing EM-based decipher¬ 
ment models. 


Symbol 

Meaning 

Nf 

Number of unique source bigrams 

Ne 

Number of unique target bigrams 

Vf 

Source Vocabulary 

Ve 

Target Vocabulary 

V 

max( Vf , Vb ) 

n 

Number of samples 

K 

Beam size for precomputed lists 

4> 

Unigram level feature function 

$ 

Bigram level feature function: ^ £2 


Table 1: Our notations and symbols. 


We apply three different MCMC sampling 
strategies for scalable training and compare 
them in terms of running time and accu¬ 
racy. Our results show that Contrastive Di¬ 
vergence (Hinton, 2002 1 based MCMC sam¬ 
pling can dramatically improve the speed of 
the training, while achieving comparable ac¬ 
curacy. 


with fi and /2 respectively. The generative pro¬ 
cess is typically modeled via a Hidden Markov 
Model (HMM) as shown in Figure [TJa). The 
target bigram language model p{eie 2 ) is trained 
from the given monolingual target corpus S. The 
translation probabilities p{f\e) are unknown, and 
learned by maximizing the likelihood of the ob¬ 
served source corpus T: 


2 Problem Formulation 

Given a source text T and an independent target 
corpus S, our goal is to decipher the source text 
T by learning the mapping between the words in 
the source and the target language. Although the 
sentences in the source and target corpus are in¬ 
dependent of each other, there exist distributional 
and lexical similarities among the words of the 
two languages. We aim to automatically learn the 
translation probabilities p{f\e) by exploiting the 
similarities between the bigrams in T and S. 

As a simplification step, we break down the sen¬ 
tences in the source and target corpus as a col¬ 
lection of bigrams. Let T contain a collection of 
source bigrams / 1 / 2 , and £ contain a collection of 
target bigrams 6162 . Let the source and target vo¬ 
cabulary be Vp and Vp respectively. Let Np and 
Np be the number of unique bigrams in T and £ 
respectively. We assume that the corpus T is an 
encrypted version of a plaintext in the target lan¬ 
guage. Each source word f £ Vp is obtained by 
substituting one of the words e £Vpm the plain¬ 
text. However, the mappings between the words in 
the two languages are unknown, and are learned as 
latent variables. 

3 Background Research 

Existing decipherment models assume that each 
source bigram / 1/2 in F is generated by first gen¬ 
erating a target bigram 6162 according to the target 
language model, and then substituting ei and 62 


p{^) = n ( 1 ) 

= n E p(eie2)p(/i|ei)p(/2|e2), 

6162 


where ei and 62 are the latent variables, indicating 
the target words in Vp corresponding to /i and /2 
respectively. The log-likelihood function with la¬ 
tent variables is non-convex, and several methods 
have been proposed for maximizing it. 


3.1 Expectation-Maximization (EM) 


The Expectation-Maximization (EM) (Dempster 


et ah, 1977)) algorithm has been widely applied 


for solving the decipherment problem ([Knight and 


Yamada, 1999t Koehn and Knight, 2000 1 . In 


the E-step, for each source bigram / 1 / 2 , we es¬ 
timate the expected counts of the latent variables 
ei and 62 over all the target words in Vp. In 
the M-step, the expected counts are normalized 
to obtain the translation probabilities p{f\e). The 
computational complexity of the EM algorithm is 
0{NpV‘^) and the memory complexity is 
where Np is the number of unique bigrams in F 
and V = maxdVpl, \ Vp\). As a result, the reg¬ 
ular EM algorithm is prohibitively expensive for 
large vocabulary sizes, both in terms of running 
time and memory consumption. 

To address this challenge, Ravi and 
Knight ( 2011| l proposed the Iterative EM al¬ 
gorithm, which starts with the K most frequent 
words from F and £ and performs EM-based 



























decipherment. Next, the source and target vocab¬ 
ularies are iteratively extended by K new words, 
while pruning low probability entties from the 
probability table. The computational complexity 
of each iteration becomes 0{NpK'^). 


3.2 Bayesian Decipherment using Gibbs 
Sampling 

Ravi and Knight ( |2011 1 proposed a Gibbs sam¬ 
pling based Bayesian Decipherment strategy. For 
each observed source bigram / 1 / 2 , the Gibbs sam¬ 
pling approach starts with an initial target bi¬ 
gram eie 2 , and alternately fixes one of the tar¬ 
get words and replaces the other with a randomly 
chosen sample. When ei is fixed, a new sam¬ 
ple is drawn from the probability distribution 
p(eie 2 ®"')p(/ 2 |e 2 ®’^)- Next, we fix 62 and sample 
gnew^ and continue alternating until n samples are 
collected. Bayesian decipherment reduces mem¬ 
ory consumption via Gibbs sampling. The prob¬ 
ability table remains sparse, since only a small 
number of word pairs (/, e) will be observed to¬ 
gether in the samples. 


3.3 Slice Sampling 

To draw each sample via Gibbs sampling, we need 
to estimate the probabilities of choosing each tar¬ 
get word e G Vp, which requires 0{V) opera¬ 
tions. To address this issue, Dou et al. ( |2012 1 
proposed a slice sampling approach with precom¬ 
puted top-iti lists. Similar to Gibbs sampling, for 
each source bigram / 1 / 2 , the slice sampling ap¬ 
proach starts with one initial target bigram 6162 , 
and alternately replaces either ei or 62 while keep¬ 
ing the other one fixed. In order to replace ei with 
a new sample e”®"', we sample a random thresh¬ 
old T uniformly between 0 and p(eie 2 )p(/i|ei). 
Next, we uniformly sample an e”®"' from all the 
candidates e'^ such that p[e'ie 2 )p{fi\ e'l) > T. 
While sampling T is straightforward, the second 
sampling stage requires finding all the candidates, 
which again takes 0{V) computation. Dou et 
al. ( 20121 addressed this challenge by precomput¬ 
ing sorted top-iF word lists for both p(/|e) and 
p(ei,e 2 ). While sampling ei, it tries to generate 
all the candidates by looking only at the top-K 
lists for p{e'i\fi) and the top K list for p(e']^e 2 ). 
Even though slice sampling with top-K lists is 
faster than Gibbs sampling on average, sometimes 
the top-K lists fail to provide all the candidates, 
and it needs to fall back to sampling from the en¬ 
tire vocabulary, which requires 0{V) operations. 


3.4 Beam Search 


Nuhn et al. ( |2013t |2014| ) showed that Beam 
Search can significantly improve the speed of EM- 
based decipherment, while providing comparable 
or even slightly better accuracy. Beam search 
prunes less promising latent states by maintain¬ 
ing two constant-sized beams, one for the trans¬ 
lation probabilities p{f\e) and one for the target 
bigram probabilities p(eie 2 ) - reducing the com¬ 
putational complexity to 0{Np). Eurthermore, 
it saves memory because many of the word pairs 
(/, e) are never considered due to not being in the 
beam. 


3.5 Feature-based Generative Models 


Eeature-based representations have previously 
been explored under the generative setting. 
Haghighi et al. ( |2008 1 proposed a Canonical Cor¬ 
relation Analysis (CCA) based model for auto¬ 
matically learning the mapping between the words 
in two languages from monolingual corpora only. 
They exploited the orthographic and contextual 
features between the word types, but ignored the 
token-level frequencies. Ravi ( 2013| ) proposed a 
Bayesian decipherment model based on hash sam¬ 
pling, which takes advantage of feature-based sim¬ 
ilarities between source and target words. How¬ 
ever, the feature representation was not integrated 
with their decipherment model, and was only used 
for efficiently sampling candidate target transla¬ 
tions for each source word. Eurthermore, the fea¬ 
ture based hash sampling included only contextual 
features, and did not consider orthographic fea¬ 
tures. In contrast, our log-linear model integrates 
both type-level orthographic features and token- 
level bigram frequencies. 


4 Feature-based Decipherment 

Our feature-based decipherment model is based 
on a chain structured Markov Random Eield (Eig- 
ure [TJb)), which jointly models the observed 
source bigrams / 1/2 and corresponding latent tar¬ 
get bigram 6162 - Eor each source word / G Vp, 
we have a latent variable e £ Vp indicating the 
corresponding target word. The joint probability 
distribution: 

P{fif 2 ,eie 2 ) = ^ exp w^$(/i/ 2 , 6162 )^( 6162 ), 

Zw 

( 2 ) 

where ^(/i 72 , 6162 ) is the feature function for 
the given source and the target bigrams, w is the 






















(a) Hidden Markov Model (HMM) (b) Markov Random Field (MRF) 


Method 

Complexity per Iteration 

EM 

0{NfV'^) 

EM + Slice 

OiNpriV), but often faster 

Log-linear Exact 

0{NfV'^ + V'^) 

Log-linear + Gibbs 

oInfVu + Vn^) 

Log-linear + IMH + Gibbs 

oInfu + Vn-^) 

Log-linear + CD 

0{NFn) 


Table 2: The worst case computational complexi¬ 
ties for different decipherment algorithms 


Figure 1: The graphical models for the existing 
directed HMM and the proposed undirected MRF. 

model parameters, and Zw is the normalization 
term. We assume that the bigram feature function 
decomposes linearly over the two unigrams: 

^(/i/ 2 , 6162 ) = 0(/i, ei) + 4>{f2, 62 ) (3) 

The normalization term is: 

p(eie 2 ) exp w'^$(/i/ 2 , 6162 ) 

/1/2 6162 

The gradient of the joint log-likelihood is: 


where Z(/i/ 2 ) is the normalization term given 
/ 1 / 2 : 

Z{fif2)= ^ p(eie2) exp w'^$(/i/2, 6162). 

6ie2eVj 

For each observed /i /2 G we need to sum over 
all possible eie 2 G FJ, which requires 0{NfV‘^) 
computation. 

4.2 Estimating Full Expectation 

For the full expectation, we assume that both the 
source text and latent variables are unknown. We 
estimate it by summing over all the possible source 
bigrams / 1 / 2 , and associated latent variables 6162 : 


dL 

— ®'eie2|/i/2 [^(/l/2) 6162)] — 


%i/2,6ie2 [^(/l/2,eie2)] 
'^Forced _ -^Full 


•^Full 


# E E 

^ /i/2et/|eie2ev| 


p{eie 2 ) 


exp w^$(/i/ 2 ,eie 2 ) 


^(/i/2,eie2), 


(5) 


Here, the first term is the expectation with respect 
to the empirical data distribution. We refer to it 
as the “Forced Expectation”, as the source text is 
assumed to be given. The second term is the ex¬ 
pectation with respect to our model distribution, 
and referred to as “Full Expectation”. In theory, 
we can apply gradient descent or other off-the- 
shelf optimization techniques to optimize the con¬ 
ditional log-likelihood. However, exact estimation 
of the gradient is computationally expensive, as 
discussed in the next sub-sections. 

4.1 Estimating Forced Expectation 

We estimate the forced expectation over latent 
variables using the following equation: 


where Zg is the global normalization term: 

^9= ^^(^ 162 ) 

fif 2 &V^eie 2 &Vi 

exp w^$(/i/ 2 ,eie 2 ). 
The computational complexity is 0{V^). 

5 MCMC Sampling for Faster Training 

The overall computational complexity of estimat¬ 
ing the exact gradient is 0 {NpV‘^ + V^), which 
is infeasible for decipherment even with a modest¬ 
sized vocabulary. Instead, we apply several dif¬ 
ferent MCMC sampling methods to approximately 
estimate the forced and full expectations. 


E 


Forced 


E 


Z(hh) 


E 


p(eie2) 


exp w^$(/i/ 2 , 6162 ) 


^(/i/ 2 ,eie 2 ), (4) 


5.1 Gibbs Sampling 

5.1.1 Gibbs Sampling for Approximating 
Forced Expectation 

Instead of summing over all target bigrams eie 2 , 
we approximate the forced expectation by taking n 






















samples of 6162 for each observed /1/2, and take 
an average of the features for these samples. For 
each observed /1/2, the following steps are taken: 

• Start with an initial target bigram 6162- 

• Fix 62 and sample according to the fol¬ 
lowing probability distribution: 

P{er"\e2jif2) = ^\p{eT^e2) 

^gihhs 

exp w^$(/i/2,ei''"'e2) 

where 

^gibbs — E p(eie2) exp w^$(/i/2, 6162) 

ei 

• Next, fix 61 and draw a new sample 62 simi¬ 
larly according to P(62'^"'|6i, /1/2), and con¬ 
tinue sampling 61 and 62 alternately until n 
samples are drawn. 

Drawing each sample requires 0 {V) operations, 
as we need to estimate the normalization term 
Zgibbs- The computational complexity of estimat¬ 
ing the forced expectation becomes: 0 {NpVn), 
which is expensive as V can be large. 

5.1.2 Gibbs Sampling for Approximating 
Full Expectation 

To efficiently estimate the full expectation, we 
sample n source bigrams /1/2 from our model. 
The Gibbs sampling procedure is: 

• Start with an initial random/1/2. 

• Fix /2, and sample a new /i according to 
p(/i|/2): 

M/1I/2) = p{eie 2 ) 

^gibbs ei 62 - 

exp w'^$(/l/2,6162) 

where 

^gibbs = eeeI p(ei62) 

fl ei ^2 L 

expw'^$(/l/2,6i62) 


• Next fix fl and sample /2 according to 
T’(/2 |/i)- Continue alternating until n sam¬ 
ples are drawn. 

The computational complexity of exactly estimat- 
ing p(/i|/2) is 0 {V^), resulting in the compu¬ 
tational complexity 0 {V^n), which is infeasible. 
However, instead of summing over all possible 
6162, we can approximate via sampling. For each 
/1/2, we first sample n samples 6162 according to 
^(6162). Let S be the set of n samples of target 
bigrams. Next, we approximate p(/i|72) as: 

f(/i|/ 2) = ^ expw'^$(/i/2,6162) 

^approx 

where 

^approx — E E exp W^$(/i/2,6i62) 

fl 616265 ' 

This reduces the computational complexity to 

0(Hn2). 

5.2 Independent Metropolis Hastings (IMH) 

The Gibbs sampling for our log-linear model is 
slow as it requires normalizing the sampling prob¬ 
abilities over the entire vocabulary. To address this 
challenge, we apply Independent Metropolis Hast¬ 
ings (IMH) sampling, which relies on a proposal 
distribution and does not require normalization. 
However, finding an appropriate proposal distribu¬ 
tion can sometimes be challenging, as it needs to 
be close to the true distribution for faster mixing 
and must be easy to sample from. 

For the forced expectation, one possibility is to 
use the bigram language model ^(6162) as a pro¬ 
posal distribution. However, the bigram language 
model did not work well in practice. Since ^(6162) 
does not depend on /1/2, it resulted in slow mix¬ 
ing and exhibited a bias towards highly frequent 
target words. 

Instead, we chose an approximation of 
^(61621/1/2) as our proposal distribution. To 
simplify sampling, we assume 61 and 62 to 
be independent of each other for any given 
/1/2. Therefore, the proposal distribution 
9(61621/1/2) = 9«(6i|/i)9n(62|/2), where 

Qu{e\f) is a probability distribution over target 
unigrams for a given source unigram. We define 
qu{e\f) as follows: 


quie\f) = {1 - Pb)qsif\e) +Pb^ 



where ph is a small back-off probability with 
which we fall back to the uniform distribution 
over target unigrams. The other term qs{e\f) is 
a distribution over the target words e for which 
(/,e) E w: 

. I.x /^expw^(/)(/,e), if(/,e)Ew 
Qsie\f) = < . 

I 0, otherwise. 


For each observed source bigram /i /2 E T, the 
contrastive divergence sampling procedure works 
as follows: 

• Sample a target bigram 6162 according to the 
distribution p(ei 621 / 1 / 2 )- We perform this 
step using Independent Metropolis Hastings, 
as discussed in the previous section. 


Here, Zimh is a normalization term over all the 
e such that (/, e) E w. The weight vector w is 
sparse, as only a small number of translation fea¬ 
tures (/, e) (Section are observed during sam¬ 
pling. Furthermore, we update qg only once every 
5 iterations of gradient descent. 

The actual target distribution is: 

^(61621/1/2) oc p(eie2) exp w^$(/i/2,6162) 

( 6 ) 

For each / 1/2 E T, we take the following steps 
during sampling: 

• Start with an initial English bigram: ( 0162 )° 

• Let the current sample be ( 6162 )*. Next, sam¬ 
ple ( 6162 )*^^ from the proposal distribution 
^(61621/1/2). 

• Accept the new sample with the probability: 

p ^ P{{eie2y^^\fif2) g((eie2)^|/i/2) 
p((eie2)*|/i/2) g((eie2)*+^|/i/2) 


The IMH sampling reduces the complexity of 
the forced expectation estimation to 0{Npn) 
which is significantly less than the complexity of 
0{NpVn) in the case of Gibbs sampling. How¬ 
ever, we could not apply IMH while estimating 
the full expectation, as finding a suifable proposal 
disfribufion is more complicafed. Therefore, fhe 
overall complexify remains: 0{Npn + Vv?). 


5.3 Contrastive Divergence Based Sampling 

The main reason for the slow training of the pro¬ 
posed log-linear model is the high computational 
cost of estimating the partition function Zg of our 
MRF model when estimating the full expectation. 
A similar problem arises while training deep neu¬ 
ral networks. An increasingly popular technique 
to address this issue is to perform Contrastive Di¬ 
vergence (Hinton, 20021, which allows us to avoid 
estimating the partition function. 


'ignoring the cost of estimating qs{e\f), which occurs 
only once every 5 iterations. 


• Sample a reconstructed source bigram 
(/i/ 2 )'^'^^°” by sampling from the distribu¬ 
tion p(/i/ 2 |eie 2 ), again via Independent 
Metropolis Hastings. 


We take n such samples of 6162 and corresponding 
For each sample and reconstruction 
pair, we update the weight vector by an approxi¬ 
mation of the gradient: 

BT 

^ ^ ^((/i/2)‘'“*“,eie2)-$((/i/2)™,eie2) 

6 Feature Design 

We included the following unigram-level features: 


• Translation Features: each (/, e) word pair, 
where f £ Vp and e E Vp, is a. poten¬ 
tial feature in our model. While there are 
0 (V^) such possible features, we only in¬ 
clude the ones that are observed during sam¬ 
pling. Therefore, our feature weights w is a 
sparse vector, with most of the entries zero. 

• Orthographic Features: we incorporated an 
orthographic feature based on the normalized 
edit-distance. For a word pair (e, /), the or¬ 
thographic feature is triggered if the normal¬ 
ized edit distance is less than a threshold (set 
to 0.3 in our experiments). 


The set of features can further be extended by in¬ 
cluding context window based features ([Haghighl 


et ah, 20()^|Ravi, 2013 1 and topic features. 


7 Experiments and Results 
7.1 Datasets 


We experimented with two closely related lan¬ 
guage pairs: (1) Spanish and English and (2) 
Erench and English. Eor Spanish/English, we ex¬ 
perimented with a subset of the OPUS Subtitle 
corpus ( [Tiedemann, 2009 1. Eor Erench/English, 
we used the Hansard corpus ( Brown et ah, 199f] l, 
containing parallel Erench and English text from 
















Dataset 

Num. Sentences 

\Ve\ 

|VT| 

OPUS 

19.77A'(1128 unique) 

579 

411 

Hansard-100 

100 

358 

371 

Hansard-1000 

1000 

2957 

3082 


Table 3: Statistics on the datasets used in our ex¬ 
periments. 


the proceedings of the Canadian Parliament. In 
order to have a non-parallel setup, we extracted 
monolingual text from different sections of the 
French and English text. The detailed description 
of the two datasets are provided below: 

OPUS Subtitle Dataset: the OPUS dataset 
is a smaller pre-processed subset of the origi¬ 
nal larger OPUS Spanish/English parallel corpora. 
The dataset consists of short sentences in Span¬ 
ish and English, each of which is a movie subtitle. 
The same dataset has been used in several previ¬ 


ous decipherment experiments (Ravi and Knight, 


201 H Nuhn and Ney, 2014 Ravi, 20131. 


Hansard Dataset: The Hansard dataset con¬ 
tains parallel text from the Canadian Parliament 
Proceedings. We experimented with two datasets: 


• Hansard-100: The Prench text consists of 
the hrst 100 sentences and the English text 
consists of the second 100 sentences. 


• Hansard-1000: The Prench text consists of 
the hrst 1000 sentences and the English text 
consists of the second 1000 sentences. 


Table provides some statistics on the three 
datasets used in our experiments. Due to the 
relatively small vocabulary size of OPUS and 
Hansard-100 dataset, we were able to run all 4 
versions of the log-linear model and compare with 
the exact EM-based decipherment. The Hansard- 
1000 dataset, however, is too large to run the ex¬ 
act EM and some of the inexact log-linear models 
(e.g., Gibbs sampling and IMH -i- Gibbs). As a re¬ 
sult, we only applied the fastest log-linear model 
with contrastive divergence on the Hansard-1000 
dataset. 


7.2 Evaluation 

We evaluate the accuracy of decipherment by the 
percentage of source words that are mapped to the 
correct target translation. The correct translation 
for each source word was determined automati¬ 
cally using the Google Translation API. While the 
Google Translation API did a fair job of translat¬ 
ing the Prench and Spanish words to English, it 


Method 

BLEU (%) 

EM (Ravi and Knight, 2011) 

15.3 

EM + Beam Search (Nuhn and Ney, 2014) 

15.7 

Log-linear + Gibbs 

18.9 

Log-linear + IMH 

18.8 

Log-linear + CD 

18.8 


Table 5: Comparison of MT performance on the 
OPUS dataset using bigram language model. 


returned only a single target translation. We no¬ 
ticed occasional cases where the decipherment al¬ 
gorithm retrieved the correct translation, but it did 
not get the credit because of not matching with the 
translation from the API. 

Additionally, we performed Viterbi decoding 
on the sentences in a small held-out test corpus 
from the OPUS dataset, and compared the BEEU 
scores with the previously published results on the 


same training and test sets (Ravi and Knight, 2011 


Nuhn and Ney, 2014 Ravi, 20131. 


7.3 Results 

We experimented with three versions of our log- 
linear decipherment algorithms: (1) Gibbs Sam¬ 
pling, (2) IMH and Gibbs Sampling, and (3) Con¬ 
trastive Divergence (CD). To determine the impact 
of the orthographic features, the Contrastive Di¬ 
vergence based log-linear model was tested both 
with and without the orthographic features. We 
compared the log-linear models with the exact EM 


algorithm (Koehn and Knight, 2000 Ravi and 


Knight, 20111. We could not include the exact 


log-linear model in our experiments due to the ex¬ 
tremely slow training. The number of iterations 
was hxed to 50 for all hve methods. Por the sam¬ 
pling based methods, we set the number of sam¬ 
ples n = 50. 

Por the log-linear model with no orthographic 
features, we initialized all the feature weights to 
zero. We do not store these initial weights in mem¬ 
ory, as they are all set to zero by default. When 
we included the orthographic features, we initial¬ 
ized the weight of the orthographic match feature 
to 1.0 to encourage translation pairs with high or¬ 
thographic similarity. Purthermore, for each word 
pair (/, e) with high orthographic similarity, we 
assigned a small positive weight (0.1). This initial¬ 
ization allowed the proposal distribution to sam¬ 
ple orthographically similar target words for each 
source word. Por the exact EM, we initialized the 
translation probabilities uniformly and stored the 
entire probability table. 




























Method 

OPUS 

Hansard-100 

Hansard-1000 

lime 

Acc (%) 

time 

Acc (%) 

'lime 

Acc (%) 

EM 

520.2s 

6.04 

188.0s 

2.96 

- 

- 

Log-linear + Gibbs 

429.7s 

8.63 

207.3s 

14.02 

- 

- 

Log-linear + IMH + Gibbs 

61.6s 

8.46 

39.0s 

13.21 

- 

- 

Log-linear + CD 

15 . 1 s 

8.46 

7 . 77 s 

12.93 

401 . 0 s 

15.08 

Log-linear + CD (No ortho) 

15 . 3 s 

1.89 

7 . 70 s 

3.50 

396 . 4 s 

2.66 


Table 4: The running time per iteration and accuracy of decipherment. 


We applied all four log-linear models and the 
exact EM on the OPUS and the Hansard-100 
datasets. On the Hansard-1000 dataset, we could 
only apply the Contrastive Divergence based log- 
linear model (with and without orthographic fea¬ 
tures) due to its large vocabulary sizes. Table 
reports the accuracy and the running time per iter¬ 
ation for all the methods on the three datasets. The 
BLEU scores for the OPUS dataset are reported in 
Table A bigram language model was used for 
all the models. Table shows a few examples for 
which the log-linear model performed better due 
to orthographic features. 

8 Discussion and Future Work 

We notice that all the log-linear models with or¬ 
thographic features outperformed the EM-based 
methods. The only log-linear model which per¬ 
formed much worse was the one which lacked the 
orthographic features. This result emphasizes the 
importance of orthographic features for decipher¬ 
ment between closely related language pairs. The 
margin of improvement due to orthographic fea¬ 
tures was bigger for the Hansard datasets than that 
for the OPUS dataset. It is expected, as the lexical 
similarity between Erench and English is higher 
than that for Spanish and English. The Contrastive 
Divergence based log-linear model achieved com¬ 
parable accuracy to the two other log-linear mod¬ 
els, despite being orders of magnitude faster. Eur- 
thermore, the log-linear models resulted in better 
translations, as they obtained significantly higher 
BLEU score on the OPUS dataset (Table[^. 

While the orthographic features provide huge 


OPUS 

Hansard-1000 

Spanish 

English 

French 

English 

excelente 

excellent 

criminel 

criminal 

minuto 

minute 

particulier 

particular 

silencio 

silence 

sociaux 

social 

perfecto 

perfect 

secteur 

sector 


Table 6: A few sample examples, for which ortho¬ 
graphic features helped. 


improvements in decipherment accuracy, they also 
introduce new errors. Eor example, the Span¬ 
ish word “madre” means “mother” in English, but 
our model gave highest score to the English word 
“made” due to the high orthographic similarity. 
However, such error cases are negligible compared 
to the improvement. 

In this paper, we assumed no parallel data 
is available, and experimented with fairly sim¬ 
ple initialization strategies. However, the objec¬ 
tive functions for both EM and the latent vari¬ 
able log-linear model are non-convex, and the 
results may vary drastically based on initializa¬ 


tion (Berg-Kirkpatrick and Klein, 20131. In fu¬ 


ture, we would like to start with a small parallel 
corpora, and initialize the decipherment models 
with the parameters learned from the small paral¬ 
lel corpora ( Dou et al., 2014[ ). We would also like 
to experiment with a more sophisticated transla¬ 
tion model that incorporates NULL words, local 
reordering of neighboring words, and word fertil¬ 
ities ( Ravi, 20 T 3 I 1 . Einally, we would like to in¬ 
corporate more flexible non-local features, which 
are not supported by the feature-based directed 


graphical models, such as Eeature-HMM (Berg 


Kirkpatrick et al., 20101. 


9 Conclusion 


We presented a feature-based decipherment sys¬ 
tem using latent variable log-linear models. The 
proposed models take advantage of the ortho¬ 
graphic similarities between closely related lan¬ 
guages, and outperform the existing EM-based 
models. The Contrastive Divergence based vari¬ 
ant provided the best trade-off between speed and 
accuracy. 
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